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This article proposes an algorithm for stereo matching corresponding 
process that will be used in many applications such as augmented reality, 
autonomous vehicle navigation and surface reconstruction. Basically, 
the proposed framework in this article is developed through a series of 
functions. The final result from this framework is disparity map which this 
map has the information of depth estimation. Fundamentally, the framework 
input is the stereo image which represents left and right images respectively. 
The proposed algorithm in this article has four steps in total, which starts 
with the matching cost computation using census transform, cost aggregation 
utilizes segment-tree, optimization using winner-takes-all (WTA) strategy, 
and post-processing stage uses weighted median filter. Based on the 
experimental results from the standard benchmarking evaluation system 
from the Middlebury, the disparity map results produce an average low noise 
error at 9.68% for nonocc error and 18.9% for all error attributes. 
On average, it performs far better and very competitive with other available 
methods from the benchmark system. 
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1. INTRODUCTION 


The stereo matching algorithm is a solution to the problem that occurs between the generated stereo 
image pairs. It is a widely produced technology in most applications especially in depth estimation and 
cyber-physical systems (CPS). This article explains the stereo matching algorithm used in the matching 
process to generate a disparity map for the usage in those applications. The disparity map result comprises 
depth information which will mostly be used as a leading application for depth estimation and CPS such as for 
object detection [1], 3D surface reconstruction process [2], [3], motion system [4], face detection [5], and industrial 
automation [6]. The stereo matching algorithm is important in determining each output produced, and it has a 
direct impact on the accuracy of 3D surface reconstruction. Stereo matching also determines how each stereo 
image’s estimated location, measurement, and pixels are calculated. Furthermore, the production or applications 
utilizes stereo images is not easy to be manipulated, it is crucial for machine vision system that provides output 
similar to the human vision. When an item is closer to the viewer, the gap between the eyes becomes lax, 
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according to a study that investigated the usage of zooming images [7]. Furthermore, the author precise on 
the use of zooming images stated about the resemblance of human vision with machine vision when an item 
is closer to the viewer and the disparity between the eyes increases. In other words, in order to acquire 
accurate results, the matching procedure necessitates the use of a robust framework and algorithm to generate 
an accurate disparity map. 

In current consideration of delivering accurate results, the algorithm framework must be precise to 
estimate the depth from disparity map. Therefore, this article proposes a new framework that must be able to 
reduce the noise and accurate for depth estimation. Furthermore, the map obtained is very important in the 
creation of 3D surface reconstruction [8]. Most studies conducted on developing the matching algorithms are 
depends on four main processes that was proposed by the Rhemann et al. [9] where step 1 formulates 
matching cost computation, step 2 formulates cost aggregation, step 3 expresses the disparity optimization 
(choose the normalization value with the level of disparity), and lastly step 4 formulates the disparity 
improvement (post-processing is used to fine-tune the final disparity map). By taking the basic methods of 
the stereo matching algorithm, that does not mean all the results produced are accurate. Some methods used 
have their pros and cons. There is also a method that displays quick operating time but minimal precision on 
the edges owing to incorrect window size assortment. The researchers face a difficulty in obtaining a correct 
results for the local method [10]. Based on the standard dataset from the Middlebury stereo evaluation online 
system [11], various algorithms have been proposed and evaluated in order to reduce the mismatching rate. 
The constructed dataset comprises of stereo images shown with varied ambient illuminations and many 
exposures, with and without a mirror sphere of the lighting conditions. Among the proposed algorithms that 
used the Middlebury datasets as their benchmarks are optimization for plane-based stereo [12], memory-efficient 
and robust [13], adaptive cross guided filter with weights [14], and edge-based disparity map estimation [15]. 

This article presents a new stereo matching algorithm based on census transform. It is based on the 
objective of analyzing the algorithm performances on the low texture regions, repetitive pattern, and 
discontinuity regions. The first stage of the proposed stereo matching algorithm is based on matching cost 
computation using census transform. Then the next stage reduces the noise of the images after cost 
computation is using the segment-tree method [16] where it is one of the methods that has a significant 
impact on a stereo system’s speed and accuracy. The proposed algorithm is structured similarly to a local 
method. As a result, the winner-takes-all (WTA) strategy is being used in the optimization. The last stage in 
the proposed algorithm is to use weighted median filtering. The non-linear computational complexity of 
weighted median filtering to find the median value. In line with that, it also gives a great impact in 
implementing the output where it achieves in filtered images with sharper edges. This article has also been 
properly arranged where the next section will explain the methodology in the development of the proposed 
algorithm. After that, the third section will explain the experimental analysis of the disparity map using 
qualitative and quantitative measurements that have been evaluated on Middlebury benchmark. The last 
section represents the conclusion and acknowledgment that completes the description of the whole article. 


2. RESEARCH METHOD 

The framework presented in Figure 1 is a framework that displays all the methodology used in the 
proposed algorithm. The first step in the development of stereo matching algorithm is cost computation in 
order to obtain a preliminary disparity map. The input stereo images of left and right images will be matched 
or corresponded. This process will be implemented using census transform where this technique transforms 
the input images to the binary type images. After that, these binary images will be compared at the same 
pixel locations between left and right images respectively. Then it is followed by the second step where it 
works to minimize noise while preserving the object’s edges. This article uses segment-tree method where it is 
capable of efficiently removing noise from the low texture regions as well as sharpening object borders. 
This technique is one of the most effective segmentation types [16] to increase the accuracy. The optimization 
stage uses the WTA strategy, which normalises the floating point numbers and replace it with the lowest 
disparity values on the disparity map. The WTA is fast and the normalization function for this strategy is almost 
accurate based on the study. 

Budiharto et al. [2]. The last stage of effort additionally employs one of the edge-preserving filters 
known as weighted median filtering. This is a nonlinear filter type that can be used to improve and smooth 
the final disparity map. The pixel intensities on the image at this stage are the final value of disparity map 
that will be used in the depth estimation. The applications normally applied the triangulation principal from 
the disparity value to get the depth estimation. 
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Figure 1. The proposed algorithm 


2.1. Matching cost computation 

The first stage done in completing the proposed algorithm is to use the census transform which is one 
of the important parts to ensure the preliminary data from stereo images to generate the disparity map. It is an 
important part from all other stages where it will calculate the corresponding points of the stereo images thus 
capable of avoiding unnecessary noise. Most of the problems occurred in the stage of matching process is in 
the low texture and repetitive regions that must be kept to a barely minimum error. Next census transform is a 
method that has been proposed based on the binary relative intensity from input images, it is resilient to 
variations in intensity. It is a method of non-parametric transform [17]. Where the C (P) denotes the census 
transform of a pixel P. C(P) converts the local neighborhood of a pixel P to a bit string that represents the set 
of neighboring pixels whose intensity is less than P’s. The census transform function is defined by (1). 


C(P) = @iijjeo §(P,P + [LD (1) 


Where the symbol &® symbolises the concatenation, D the non-parametric window around P, and € the transform 
defined by (2). 


Wa _ (LifP>P+{iJ] 

SP b= 0, otherwise (2) 
2.2. Cost aggregation 

Cost aggregation is basically the second stage of the local-based algorithm in the proposed 
framework. Theoretically, this stage reduces noise after what has been done in the first stage. Based on the 
literature works at early stage, the selection of appropriate and robust technique of filtering at this step is 
capable to remove high noise. Almost 70% of noise can be detached and for sure the accuracy can be 
increased. As a result, the segment-tree [18] is selected to be used at this stage. The segment-tree is a type of 
image segmentations and this method is efficient to increase the accuracy. In fact, it is utilized because 
brighter intensity levels equate to greater contributions. This segment-tree is technically based on three process 
steps that are carried out with the reference colour or intensity image, where the pixels are arranged into a series 
of segments. Second, a tree graph is generated for each segment, and finally, these independent segment graphs 
are connected to form the segment-tree structure. With the segment-tree function, the following is defined 
where the reference image is modeled as a graph G = (V, E), with a subset of edges E and E’ chosen for the 
segment-tree T = (V, E’) as given by (3), (4) and (5), step 1 (initialization). 


we = w(s,r) =|1(s)— I(r) | (3) 
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Where the color or intensity reference image I is characterised as a linked, undirected graph G = (V, E), 
where each node in V resembles to a pixel in J and each edge in E connects two adjacent pixels. The weight 
of an edge e linking pixels s and r is defined. The edges in E are sorted in a non-decreasing order based on 
the weights specified and a subtree is formed for each node in V. E’ has no edges, step 2 (grouping). 


; ; k k 
wej < minUnt(Tp) + irp "ETD + ap (4) 
Where a full scan of the edge set E, the subtrees are merged into larger groupings. Let v, and vg represent the 
nodes connected by edge ej E. If v, and vy are from distinct subtrees and the edge weight we; meets a criterion. 
T, and T, are combined into a new subtree Tp q. Simultaneously, ej is incorporated in E ". The criterion that 
considers the relative dissimilarity of the two subtrees is expressed as (4). The highest edge weight in T, is 
denoted by Int (T, ), and k is a constant parameter. Each subtree corresponds to a visually coherent segment 
after visiting each edge in E. The edges of these subtrees (which are already gathered in E’) are then 
eliminated from Æ, step 3 (linking). 


TP, = (Vpq Epa) (5) 


Where more edges are chosen from E to connect the subtrees. Its role is to look for these edges in a 
subsequent scan of E. If an edge connects two different subtrees, the subtrees should be merged and the edge 
should be included in E’. The search ends when all of the trees have been combined into a single component. 


2.3. Disparity map optimization 

The next stage which is the third stage to complete the structure disparity map is disparity 
optimization. Technically, the disparity map obtained from the previous stage will be processed in a WTA 
strategy [19]. The minimum disparity value CA (x,y, d) of each pixel of the map of disparity was applied and 
the same position was incorporated with the disparity value. The (6) is the WTA strategy. 


d(x,y) = arg mingepCA(x, y, d) (6) 


Where D represents the disparity range on the image, d (x,y) indicates the chosen disparity value at 
the co-ordinates of (x, y) and CA (p, d), the second-stage data which is the cost aggregation step. Essentially, 
the disparity map yet includes noise or erroneous pixels after this stage. This map requires to be enhanced 
and the leftover noise removed in the last step. 


2.4. Post processing 

The last stage after completing the algorithm structure is identified as the refinement disparity map or 
post-processing stage. It is a process where multiple ongoing procedures begin with the management of the 
occlusion regions, complete the pixels that are faulty, and filter the final disparity map. The method started with 
left-right inspection, progressed to fill in the invalid pixel and ended with the filtering using the weighted 
median filter (WM) with bilateral filter. This method began with an image of the left reference disparity map, 
which corresponded to the image of the right reference disparity map. Inconsistent values between the two are 
referred to be invalid disparity. The WM filtering equation is the final disparity map as indicates by (7). 


ldir P) — dri P — dir())| < tir (7) 


The next step is to fill in invalid pixels when the left image is predetermined as an image reference. Valid pixel 
substitution here is done by the filling procedure which starts from left and then from right again. The invalid 
disparity is changed by the closest valid disparity value. This value should also be inserted on the same scan 
line. An example of this can be seen in (8). 


d(p — i) <d(p + j), 


d(p + j), otherwise. (8) 


dp) =| 
(p +j) denotes location of the first valid disparity on the right side, while d(p) a disparity value at the 
location of p, while (p + i) denotes the location of first valid disparity on the left side. This method produces 
undesirable streak artefacts. The weighted median filter with bilateral filter is used to remove the remaining 
noise from the disparity map. The (9) shows the for the bilateral filter B (p, q). 


B(p,q) = exp (3°) exp (— PSP) 9) 


Os 
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Where pixels of interest (p,q) are denoted, and |p — q| refers to spatial euclidean and |d(p) — d(q)|2 
to euclidean. The spatial distance and colour parameters are 0,” and o,? are similiar. Higher weighted was 
applied to the filter since it is a form of edge preserving filter that improves the disparity map accuracy. 
The weighted of B (p, q) is transformed into the sum of histogram h (p, d,), resulting in (10). 


h(p, d,) = Èqewpla(a)==dr BO, q) my 


Where d, is the disparity range and w, is the window size with the radius (r X r) at the centred pixel of p. 
The median value of h (p, d,) given by (11) determines the final disparity value WM. 


WM = med{d|h(p,d,)} (11) 


3. RESULTS AND ANALYSIS 

The proposed algorithm’s result disparity maps are tested using the standard benchmarking dataset 
known as the Middlebury Benchmark [20]. Furthermore, the analytical experiment was carried out on a 
personal computer with Windows 10 specifications, a CPU i7 8700 @ 3.2 GHz, an RTX 2070 super, and 16 GB 
RAM. The Middlebury dataset provides fifteen standard input images that must be uploaded online as a result 
file that is ready to be compiled in.zip format. The images presented are quite complicated in shape, and each 
image comprises variety of attributes and properties such as light surrounding of objects depth, 
disorganisation regions, varying outcomes, and low texture areas. The values of {M, Os, oc, Wg, Wmr} are 
{13x9, 17, 0.3, 9x9, 13x13}. In this study, performance is tested and assessed using all and nonocc error 
properties of avgerr pixel percentages. An invalid disparity value on a non-occluded region is called a 
nonocc error. While all error is referred to the error caused by erroneous disparity values across all pixels in 
the image’s disparity map. 

In the selection of determining the accuracy of the image, the following shows one example that can 
be shown one example of the image of Adirondack in Figure 2(a) and Figure 2(b), this can be compared 
between the proposed algorithm with the algorithm proposed by Fakhar et al. [21]. Clearly shows with the 
proposed algorithm is to generate a more accurate result with less noise. Figure 3 depicts all of the displays that 
have been developed and evaluated using the Middlebury dataset. There are fifteen images produced as a result 
of the evaluation, and all correctness is defined by all error and non-occluded error (nonocc). All errors are 
measured on the basis of all image pixels and non-occluded pixels without the occluded region of disparity 
map within 15 images provided, in all the images there are some images that are quite difficult to match such 
as Jadeplant and Playtable. From the image, it contains some difficulties from the perspective of tables and 
leaves of varying dimensions. However, despite these difficulties, the proposed algorithm is capable of 
reconstructing a substantially accurate disparity map with identifiable discontinuities. 

Technically, the image obtained from Middlebury is a difficult image to process accurately 
depending on the corresponding point. The original image that is developed may contain differences in pixel 
values and it is even developed to test the robustness of an algorithm where the same relevant point is used. 
In addition, each image can be viewed from various perspectives in terms of objects of a particular color, shadows, 
discontinuity regions, and obscured areas. Figure 3 shows the results from the KITTI 2015 dataset which these 
three inputs left reference images are labelled as #000002_10, #000003_10 and #000004_10 in the database. These 
images are captured from the real environment. The disparity map results are displayed in grayscale color. 
The results show clear depth estimation with almost accurate objects detected. High grayscale intensity means the 
object is closed to the stereo camera and for the lowest grayscale value indicate the objects are faraway. 


(a) (b) 


Figure 2. The example shown in the disparity map shows the accuracy used in the Adirondack: (a) another 
proposed algorithm [19] and (b) proposed algorithm 
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Left Reference Image Disparity Map 


Figure 3. The example shown in the disparity map shows the accuracy used in the Adirondack 


Figure 4 shows the disparity map images and the results have been improved containing low 
textures surfaces such as recycle, Adirondack, and piano. The contours with varying depth and dispersion are 
also clear and visible. Apart from that, the worst of them all, namely Jadeplant, Playtable, vintage, and 
PianoL show that plain color objects and shadows are difficult to match. Regions that match similar pixel 
values make it possible to get wrong matching is very significant. From the results uploaded to Middlebury 
benchmark, it is collected as a whole the results of the proposed method along with other published methods 
based on quantitative measurements in Table 1 and Table 2. From these results as well, the results produced 
by Middlebury benchmark are described previously and in the tables are compared each result produced by 
published methods. It is the competitiveness of the proposed work that shows the level of effectiveness of the 
proposed algorithm. The proposed algorithm is at the second top rank of the table which is 9.68% for nonocc 
errors and 18.9% for all errors. On average, which Table 1 that is nonocc error, it can be compared with 
other proposed algorithms where disparity stereo geometry cross aware (DSGCA) is in third ranked followed 
by pixel pair based guided filter (PPEP-GF), sample photoconsistency (SPS), image edge brightness image 
(IEBIMst), semi-global matching 1 (SGBM1), and multi-windows noise effected (MANE) are the lowest 
which is 11.9% error. While in Table 2 shows all the errors that are compared and shows semi-global 
matching 1 SBGM1 is in third ranked followed by MANE, double guided (DoGGuided), dynamic filter (DF), 
random-normalised cross control (R-NCC), and binary streo matching (BSM) are at the lowest which is 
23.5% error. It is clear that the table shown can be competitive with other published work and this 
comparison is shown in detail. The method Kong et al. [14] is the lowest error based on the results in Table 1 
and Table 2 respectively. However, the proposed method in this article still produced good results and image 
for the PianoL compared to area cross region guided filter (ACR-GIF-OW) [22]. 


Table 1. Performance comparison of nonocc error from the Middlebury images 


2 
£ 
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ACR-GIF-OW [23] 5.78 3.01 3.91 11.2 2.81 2.91 4.95 27.1 459 549 123 2.58 2.50 126 1.86 6.58 
Proposed algorithm 9.68 3.94 746 185 4.66 443 5.97 21.8 7.24 692 344 132 4.20 11.8 3.91 20.1 


DSGCA [23] 975 3.25 5.95 18.9 3.60 341 717 21.1 7.23 9.36 294 7.94 3.80 14.7 3.51 39.7 
SPS [24] 10.4 3.57 5.34 22.8 3.11 3.15 9.34 229 6.78 12.5 9.70 7.64 6.27 223 1.52 52.6 
IEBIMst [25] 11.1 26.1 4.67 41.9 2.72 4.99 569 17.5 547 12.9 148 3.26 499 164 2.64 10.4 
SGBM1 [26] 11.3 18.3 745 15.7 348 291 6.51 384 5.37 12.8 13.5 3.24 344 15.1 3.00 11.1 
MANE [27] 11.9 6.58 5.81 20.7 452 431 106 209 862 150 347 105 5.50 202 3.12 465 


Table 2. Performance comparison of all error from the Middlebury images 
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ACR-GIF-OW [22] 9.48 4.53 841 22.1 7.93 7.88 6.36 27.7 11.0 851 161 660 4.26 13.1 2.86 7.77 
Proposed algorithm 18.9 7.95 25.0 41.3 12.2 12.0 11.2 262 200 243 388 19.3 7.82 14.8 13.6 26.7 


SGBM1 [26] 18.9 21.1 17.8 38.7 11.0 364 11.6 40.0 13.6 254 20.0 8.74 5.97 17.6 10.7 18.3 
MANE [27] 21.3 116 229 45.9 124 12.3 15.1 24.7 223 31.1 399 17.3 9.67 225 12.5 51.0 
SPS [24] 22.3 20.1 28.0 565 138 168 134 37.3 23.8 30.3 30.8 13.0 9.13 190 13.4 23.6 
IEBIMst [25] 22.7 14.1 18.2 103 13.2 12.7 11.1 26.4 22.5 209 13.9 163 168 11.5 6.16 26.8 
DSGCA [28] 23.5 127 28.7 58.7 148 147 160 35.8 245 294 31.0 202 121 19.2 143 393 
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Figure 4. Image results from proposed algorithm was evaluated using Middlebury benchmark dataset 


4. CONCLUSION 

This article presents a framework of stereo matching algorithm. The filter used in the proposed 
framework allows it to achieve the desired accuracy to coincide with the results produced and able to remove 
noise. Frameworks that start with cost computation that use census transform are able to increase the effectiveness 
on the disparity map. While the cost aggregation that uses segment-tree is able to reduce noise after process before 
and it preserves the preliminary disparity map’s object boundaries. The WTA strategy proposed in the optimization 
section further strengthens the framework by normalizing the floating point numbers in accordance with the 
disparity values. The last framework known as refinement disparity map is to use weighted median filtering to 
reduce residual noise and improved the final disparity map’s efficiency. The entire framework is able to compete 
with other works. Based on the results released from Middlebury benchmarks, it is able to obtain second low 
average errors at 9.68% for nonocc errors and 18.9% for all errors. The overall results of the findings are improved. 
In future work, a more extensive investigation should be conducted by extending our method to include more 
ways necessary to further reduce the inaccuracies in the existing results. Additionally, a long-term possibility 
that should be investigated is improving skills for optimum implementation on graphics processing unit (GPU) 
architecture in order to improve the method and speed of cost computation. 
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